Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Libraries and Setup tab you’ll see some code to initialize our workspace, and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here. Under the Training Focus tab we’ll outline the syllabus, identify the key objectives and set up expectations for each module.

1 Background

1.1 Algoritma

The following coursebook is produced by the team at Algoritma for its Data Science Academy workshops. The coursebook is intended for a restricted audience only, i.e. the individuals and organizations having received this coursebook directly from the training organization. It may not be reproduced, distributed, translated or adapted in any form outside these individuals and organizations without permission.

Algoritma is a data science education center with bootcamp programs offered in:

  • Bahasa Indonesia (Jakarta campus)
  • English (Singapore campus)

1.1.1 Lifelong Learning Benefits

If you’re an active student or an alumni member, you also qualify for all our future workshops, 100% free of charge as part of your lifelong learning benefits. It is a new initiative to help you gain mastery and advance your knowledge in the field of data visualization, machine learning, computer vision, natural language processing (NLP) and other sub-fields of data science. All workshops conducted by us (from 1-day to 5-day series) are available to you free-of-charge, and the benefits never expire.

1.1.2 Second Edition

This coursebook is initially written in 2017.

This is the second edition, written in late August 2020. Some of the code has been refactored to work with the latest major version of R, version 4.0. I would like to thank the incredible instructor team at Algoritma for their thorough input and assistance in the authoring and reviewing process.

1.2 Libraries and Setup

We’ll set-up caching for this notebook given how computationally expensive some of the code we will write can get.

options(scipen = 9999)
rm(list=ls())

You will need to use install.packages() to install any packages that are not already downloaded onto your machine. You then load the package into your workspace using the library() function:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 4.0.3
library(GGally)
## Warning: package 'GGally' was built under R version 4.0.3
library(ggthemes)
## Warning: package 'ggthemes' was built under R version 4.0.2
library(ggpubr)
library(leaflet)
## Warning: package 'leaflet' was built under R version 4.0.2
library(lubridate)

The data we’ll be working on is a rather recent Youtube Trending Videos dataset1. It has 13,400 records of trending videos between 14th November 2017 to 21st January 2018, and on each record of trending video is a list of variables:

General information relating to video
- Trending date
- Title (video title)
- Channel Title
- Category ID
- Publish Time
- Comments (Disabled?)
- Ratings (Disabled?)
- Video error or removed?

Statistics on a particular date - Views
- Likes
- Dislikes
- Comment Count

Illustration of a “Trending” section on YouTube:

1.3 Training Objectives

The primary objective of this course is to provide a fun and hands-on session to help participants gain full proficiency in data visualization systems and tools. You will learn to create compelling narratives by combining charting elements with a rich grammar under the guidance of the lead instructor and our team of teaching assistants.

  • Plotting Essential

  • Revision: Built-in Plotting Functionalities

  • Revision: Scatterplot, Histogram, Line and Column Bars

  • Axis, Title and Panel Styles

  • Grammar of Graphics

  • ggplot2 Basics

  • Plotting Better

  • Using Themes

  • Multi-dimensional Faceting

  • Visualizing Geo-Spatial Data with leaflet

  • Lattice Plotting system

By the end of the workshop, Academy students can choose to complete either of the Learn-By-Building modules as their graded assignment:

Ready for Publication
Applying what you’ve learned, create a visualization that is polished with the appropriate annotations, aesthetics and some simple commentary. This can be any visualization using the YouTube dataset, but it should communicate a story.

Interactive Map
Create a web page with an interactive map embedded on it. Use a custom icon for the map markers to represent business locations, and show details about each location pin (“markers”) upon user’s interaction.

This graded assignment is worth (2) Points.

2 Plotting Essentials

R as a statistical computing environment packs a generous amount of tools allowing us to reshape, clean and visualize our data through its built-in capabilities. In the first part of this coursebook, we’ll take a look at many of these capabilities and learn how to incorporate these into our day-to-day data science work.

In the second part of this coursebook, we’ll shift our focus onto ggplot, a plotting system by Hadley Wickham. As you’ll see in this 3 days workshop, this plotting system is among the most popular visualization tools today because of its power, extensibility and simplicity (an unlikely combination).

To get started with plotting in R, let’s start by reading our data into the environment:

vids <- read.csv("data_input/USvideos.csv")
names(vids)
##  [1] "trending_date"          "title"                  "channel_title"         
##  [4] "category_id"            "publish_time"           "views"                 
##  [7] "likes"                  "dislikes"               "comment_count"         
## [10] "comments_disabled"      "ratings_disabled"       "video_error_or_removed"

Taking a quick peek at the trending_date column reveals that the date values are stored in a year-day-month format, with . (dot) being the delimiter:

head(vids$trending_date)
## [1] "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11" "17.14.11"

Because these values follows the yy.dd.mm format, our data processing steps would handle this format accordindly through the format argument:

vids$trending_date <- as.Date(vids$trending_date, format="%y.%d.%m")

In real life, most date formats uses a combination that maps to the following:

  • %Y: 4-digit year (1982)
  • %y: 2-digit year (82)
  • %m: 2-digit month (01)
  • %d: 2-digit day of the month (13)
  • %A: weekday (Wednesday)
  • %a: abbreviated weekday (Wed)
  • %B: month (January)
  • %b: abbreviated month (Jan)

You should follow your lead instructor on a few in-classroom practice to increase your familiarity with the as.Date() function, as processing dates are a fairly common process in many data analysis routines:

# Demo: 
as.Date("Aug 30,2020", format = "%b %d,%Y")

# Dive Deeper (complete the following):
as.Date("30aug20", format = ___)
as.Date("2020-08-30", format = ___)
as.Date("08.30.2020", format = ___)

The raw dataset does not have the proper names for each category, but identify them by an “id” instead. The following code chunk “switches” them by “id” and also convert that to a factor. We will also convert our video titles to a character vector:

vids$category_id <- sapply(as.character(vids$category_id), switch, 
                           "1" = "Film and Animation",
                           "2" = "Autos and Vehicles", 
                           "10" = "Music", 
                           "15" = "Pets and Animals", 
                           "17" = "Sports",
                           "19" = "Travel and Events", 
                           "20" = "Gaming", 
                           "22" = "People and Blogs", 
                           "23" = "Comedy",
                           "24" = "Entertainment", 
                           "25" = "News and Politics",
                           "26" = "Howto and Style", 
                           "27" = "Education",
                           "28" = "Science and Technology", 
                           "29" = "Nonprofit and Activism",
                           "43" = "Shows")

vids$category_id <- as.factor(vids$category_id)

And with that, the next thing we’ll need to do is to convert the publish_time variable into a date-time class object. Because we’re analyzing trending YouTube videos in the US, it makes sense then that we uses a timezone like New York for our analysis:

as.POSIXct(head(vids$publish_time), 
           format="%Y-%m-%dT%H:%M:%S", 
           tz="America/New_York")
## [1] "2017-11-13 17:13:01 EST" "2017-11-13 07:30:00 EST"
## [3] "2017-11-12 19:05:24 EST" "2017-11-13 11:00:04 EST"
## [5] "2017-11-12 18:01:41 EST" "2017-11-13 19:07:23 EST"

Once we’ve done a sanity check, we can go ahead and apply the conversion to the whole column:

vids$publish_time <- as.POSIXct(vids$publish_time, 
           format="%Y-%m-%dT%H:%M:%S", 
           tz="America/New_York")

Once we’ve done a sanity check, we can go ahead and apply the conversion to the whole column:

vids$publish_time <- as.POSIXct(vids$publish_time, 
           format="%Y-%m-%dT%H:%M:%S", 
           tz="America/New_York")

By this point you may be wondering if this could have been made simpler. Allow me to introduce a popular package by the name of lubridate. Using lubridate, instead of manually specifying the format, you do it “declaratively”, like the following:

date <-  c("20.30.8", "20.31.8")
date2 <- c("30-8-20", "31-8-20")
date3 <- c("2017-11-13 17:13:01 EST")

# base R
as.Date(date, format="%y.%d.%m")
## [1] "2020-08-30" "2020-08-31"
as.Date(date2, format="%d-%m-%y")
## [1] "2020-08-30" "2020-08-31"
as.POSIXct(date3, format="%Y-%m-%d %H:%M:%S", tz="UTC")
## [1] "2017-11-13 17:13:01 UTC"
# lubridate equivalent:
ydm(date)
## [1] "2020-08-30" "2020-08-31"
dmy(date2)
## [1] "2020-08-30" "2020-08-31"
ymd_hms(date3, tz="UTC")
## [1] "2017-11-13 17:13:01 UTC"

Observe how simple lubridate work with dates and time. In fact, when we use ymd, ymd_hms or one of its variants, these functions recognize the patterns and will identify the right separators as long as the order of formats is correct. These functions will also parse dates correctly even when the input contain differently formatted dates!

Let’s see a few more things that lubridate can do. I’ll subset the data for the most popular trending video (by number of views) and we’ll extract information from the trending_date of the most popular trending video:

most <- vids[vids$views == max(vids$views),]
year(most$trending_date)
## [1] 2017
month(most$trending_date)
## [1] 12
day(most$trending_date)
## [1] 14

We will also go ahead and create three new variables for our data frame, storing the hours, period of the day, and the day of the week of each video at the time of publish:

vids$publish_hour <- hour(vids$publish_time)

pw <- function(x){
    if(x < 8){
      x <- "12am to 8am"
    }else if(x >= 8 & x < 16){
      x <- "8am to 3pm"
    }else{
      x <- "3pm to 12am"
    }  
}

vids$publish_when <- as.factor(sapply(vids$publish_hour, pw))
vids$publish_wday <- as.factor(weekdays(vids$publish_time))

While the publish_wday is now a factor, we can also arrange it so our plots will display them in our desired order:

vids$publish_wday <- ordered(vids$publish_wday, levels=c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday", "Sunday"))

We’ll also go ahead and convert some of the variables into numeric variables as and where appropriate:

vids[,c("views", "likes", "dislikes", "comment_count")] <- lapply(vids[,c("views", "likes", "dislikes", "comment_count")], as.numeric)

Hopefully up to this point, none of the above data transformation and cleansing process looks too unfamiliar for you! If you do need a refresher, refer to the Programming for Data Science coursebook - we’re really applying many of the same ideas to a new dataset, and so you should feel somewhat comfortable up to this point of the course :)

vids has 13400 records of trending videos, but there are many videos that were trending for a few days and we really only have a collection of 2,986 unique videos. On a very broad average, each video was trending for ~4.5 days.

Let’s create a dataframe, call it vids.u that takes only the first observation of each vids.title within the data. match returns a vector of the positions of matches of its first argument in its second:

vids.u <- vids[match(unique(vids$title), vids$title),]

We’ll also create one more variable timetotrend to measure the time it takes for a video to become “trending”:

vids.u$timetotrend <- vids.u$trending_date - as.Date(vids.u$publish_time)
vids.u$timetotrend <- as.factor(ifelse(vids.u$timetotrend <= 7, vids.u$timetotrend, "8+"))

The ifelse() command is a very quick way to perform a conditional check, and then do one of two things depending if the conditionals evaluate to TRUE or FALSE. Here’s another example to illustrate this concept better:

hours <- head(vids$publish_hour)
print(hours)
## [1] 17  7 19 11 18 19
ifelse(hours >= 18, "night_shift", "day_shift")
## [1] "day_shift"   "day_shift"   "night_shift" "day_shift"   "night_shift"
## [6] "night_shift"

With these done, we’ll move into the exciting part of this workshop: plotting!

2.1 Base Plotting and Statistical Plots

Statistical plots helps us visually inspect our dataset and there are numerous ways to achieve that in R. The simplest of which is through the plot() function. In the following code we create two vectors, x and y, and created a plot:

plot(as.factor(vids.u$publish_hour), vids.u$likes/vids.u$views)

The above gives us a boxplot that compares the “likes ratio” across different period of the day. We want to observe if there is any correlation between the likes-to-view percentage and the period of time when the video was published. As expected, we did not find any obvious patterns, in part because this compares the “likes” a video have on the first day of it being “trending” and the hour when it was published onto YouTube. In a sense, any kind of effect the hour variable can have has been adjusted for (or significantly reduced) by the noise between these two events.

plot() knows how to pick sensible defaults based on the input vector it was given. To illustrate this point, I’ll subset the data to take only trending videos within the Autos, Gaming and Travel industry (every women’s favorite 3 things):

vids.agt <- vids.u[vids.u$category_id == "Autos and Vehicles" |
                     vids.u$category_id == "Gaming" | 
                     vids.u$category_id == "Travel and Events", ]

And notice that as we call plot now with two numeric variables, so it creates a scatterplot for us (instead of the boxplot, as seen earlier):

plot(vids.agt$likes, vids.agt$dislikes)

We’ll drop the empty levels from our category_id variable, and also create two new variables that measure the likes and dislikes per video view for each observation:

vids.agt$category_id <- factor(vids.agt$category_id)
vids.agt$likesp <- vids.agt$likes/vids.agt$views
vids.agt$dislikesp <- vids.agt$dislikes/vids.agt$views

Our earlier scatterplot really isn’t very informative or even pleasant to look at. The key, as it is with data visualization in general, is to have our plot be effective. A plot that is effective complements how human visual perception works. The scatterplot is ineffective because it could be communicating more with less “visual clutter” at the bottom left.

An approach to fix that is by coloring the plots. In the following code chunk, the first line is identical to the code that produces the scatterplot above except for one addition, the col (color) parameter. We mapped the color parameter to the category so the points are colored accordingly.

I’ve also added a dashed line (lty) with a width of 2 (lwd=2) to show that the correlation between the likes-to-view and dislikes-to-view ratio.

Finally, I added a legend for our plot to show how the colors of our scatterplot points map to each level of our category_id variable.

Here’s the code:

plot(vids.agt$likesp, vids.agt$dislikesp, col=vids.agt$category_id, pch=19)
abline(lm(vids.agt$dislikesp ~ vids.agt$likesp), col=8, lwd=2, lty=2)
legend("topright", legend=levels(vids.agt$category_id), fill=1:3)

With plot, the default for two numerical variables is to plot a scatterplot, but we can override the default parameters with the type argument. The following code chunk is identical to the one above, except for the type="h" addition:

plot(vids.agt$likesp, vids.agt$dislikesp, col=vids.agt$category_id, type="h")
abline(lm(vids.agt$dislikesp ~ vids.agt$likesp), col=8, lwd=2, lty=2)
legend("topright", legend=levels(vids.agt$category_id), fill=1:3)

Apart from using plot(), we can also create statistical plots using functions such as hist(). hist() takes a numeric vector and creates a histogram:

hist(vids.agt$likesp)

We can additionally use the breaks argument to control the number of bins if we were not satisfied with the default values:

hist(vids.agt$likesp, breaks=20)

Just like how we can add abline onto our plot, we can add graphical elements like lines onto this histogram too. In fact, let’s do that and also use the main argument to give our plot a new main title:

hist(vids.agt$likesp, 
     breaks=20, 
     ylim=c(0, 20), 
     col="lightblue", 
     main="Distribution of likes-per-view")
lines(density(vids.agt$likesp), col="darkblue")

While base plot can be very simple to use, they can be effective too. In fact, with the use of proper coloring, annotation and a little care on the aesthetic touches, you can communicate a lot in a graph using just R’s built-in plotting system.

In the following code chunk I’m subsetting from vids.agt only trending videos that have more than 10,000 likes and order it by the likes-to-view variable. I added a new variable, col to this new dataframe to be used in my following plot:

vids.ags <- vids.agt[vids.agt$likes > 10000, ]
vids.ags <- vids.ags[order(vids.ags$likesp), ]

# create color specifications for our dotchart
vids.ags$col[vids.ags$category_id == "Autos and Vehicles"] <- "goldenrod4"
vids.ags$col[vids.ags$category_id == "Gaming"] <- "dodgerblue4"
vids.ags$col[vids.ags$category_id == "Travel and Events"] <- "firebrick4"

We’re going to create a dot chart (or a Cleveland’s Dot Plot) by graphing the likes to view ratio of each trending video in the vids.ags dataframe, map channel_title to the labels and group these labels by category_id.

dotchart(vids.ags$likesp, labels=vids.ags$channel_title, cex=.7, pch=19, groups=vids.ags$category_id, col=vids.ags$col)

With this we see that between groups, the likes to view proportion (we’ll call it “likeability” from now on) of Autos and Vehicles are rather similar with Travel and Events.However, for gaming videos we do see a larger variance in that the top video by likeability is close to 0.13, more than 6 times difference to that of other trending videos in this category. We also observe the rough mean likeability within groups, as well as between them. We would expect Travel and Events videos to have ~4 likes per 100 views, and Gaming videos to have more than that due to the positive skew we observe from above.

Let’s talk about another kind of plot, one that most statisticians find cringeworthy for it’s undeserved popularity and prevalence in the workplace. Yes, it is the pie chart. In R’s official documentation, the pie chart is criticized as being “a very bad way of displaying information [because] the eye is good at judging linear measures and bad at judging relative areas”. Almost any data that can be represented in a pie chart can be illustrated with a bar chart or dot chart2.

If you insist on creating one, here’s the code (I’ve added some colors to make it easier to get a grasp of the measures):

pie(table(vids.ags$publish_hour), labels=names(table(vids.ags$publish_hour)), col=topo.colors(24))

2.2 Grammar of Graphics in R

2.2.1 The motivation of ggplot2

ggplot2 is created by Hadley Wickham in 2005 as an implementation of Lelad Wilkinson’s Grammar of Graphics. The idea with Grammar of Graphics is to create a formal model for data visualization, by breaking down graphics into components that could be systematically added or subtracted by the end user.

With ggplot2, plots may be created using qplot() where arguments and defaults are handled similarly to the base plotting system, or through ggplot() where user can add or alter plot components layer-by-layer with a high level of modularity.

The last point is especially important because it allows the data scientists to work with plots in a system that breaks up these different tasks. Instead of a huge, conceptually flat list of parameters to control every aspect of the plot’s final outcome, this system makes plotting a series of distinct task, each focused on one aspect of the plot’s final output.

Let us take a look at a simple example, drawing inspiration from the Earthquake incident that happened in the south of Jakarta this week (as of this writing). I’ve created a dataframe called gempa:

gempa <- data.frame(
  x=c(3.5,3,4,4.5,4.1),
  y=c(12,14,12.4,12.5,14), 
  size=c(14,4,4,6,12)
)

And now I’ll create a ggplot object using ggplot(). Because of the range of my values, this plot will use that and create a plot with these values on each scales (scales, by the way, can be thought of as just the two axis right now). Note that we’re just creating a blank plot with no geometry elements (lines, points, etc) on it yet. We save this object as g:

g <- ggplot(gempa, aes(x = x, y = y))
g

Notice how ggplot() takes two arguments: - The data - The aes which allow us to specify our mapping of the x and y variables so they are used accordingly by ggplot

Once we created our ggplot object (we named it g), we can now add a layer onto it using geom_. geom is ggplot’s way of handling geometry, i.e. how the data are represented on the plot. To illustrate this idea, let’s add a geom_point and then print the resulting object:

g + geom_point()

A recap of what we’ve done so far:

  • Creating our ggplot graphics object through ggplot()
  • We specify 2 arguments in our call to ggplot; It’s helpful to note that any argument we pass into ggplot() will be used as global options for the plot, i.e. they apply to all layers we add onto that graphics object
  • For the second argument we use the aes() function, allowing us to map variables from our gempa data to aesthetic properties of the plot (in our case we map them to the x and y axis)
  • We tell ggplot how we want the graphic objects to be represented by adding (through the “+” operator) our geom layer. Since we added geom_point, this is equivalent to adding a layer of scatterplot to represent our x and y variables

As we familiarize ourselves with this system, we will learn to use other functions to obtain a more precise control over the construction of our plot. This could be natively ggplot constructs such as scales, legends, geoms and thematic elements or this could be additional constructs that work with ggplot through the use of third-party packages. In the following example, we’re adding background_image to our original plot (g) before adding geom_point on top of the background image layer. Finally, we add the labels for our title and caption using the labs function:

library(png)
jak <- png::readPNG('assets/jakarta.png')
g +
  background_image(jak)+
  geom_point(size=gempa$size, alpha = 0.6, col="red2")+
  labs(title="Disaster Impact Zone, Invasion Area 2018", caption="source: Jakarta Distaster Relief")

Because of this design philosophy in ggplot, it presents a learning curve that is beginner-friendly and mostly logical. I said beginner-friendly, because as we will see later, all we need to do is to master the starting steps first and not worry about polishing. And with starting steps, this means:
- 1: ggplot() with data and aesthetics mapping (aes) - 2: Add to (1) a single geom layer

2.4 Learn-by-building Module: Data Visualization

To help us get ready to the learn-by-building assignment, we’ll walk through a simple exercise together. There isn’t any new concept being introduced, but rather a start-to-finish recap of the data visualization process.

First, we’ll have to subset any videos within the News and Politics category and take only the ones that have more than 10 trending videos during the explanatory period. We run the following code and see that 6 channels / publishers satisfy that condition:

news <- vids.u[vids.u$category_id == "News and Politics", ]
news <- aggregate(trending_date ~ channel_title, news, length)
news <- news[news$trending_date > 10, ]
news <- news[order(news$trending_date, decreasing = T),]
news

We can now use vids.u$channel_title %in% news$channel_title as our data source, indicating that we only wish to create our ggplot using channels that are in the list of 6 news channels above. The rest of the code is relatively straightforward:

ggplot(data=vids.u[vids.u$channel_title %in% news$channel_title,], aes(x=publish_time, y=views))+
  geom_point(aes(col=log(likes/dislikes), size=comment_count))+
  facet_grid(publish_wday~channel_title)+
  scale_color_gradient(low="red3", high="green2")+
  theme(legend.position = "none",
        axis.text.x = element_text(angle = 90, hjust = 1))

We don’t always need to use a facet plot, if the information you wish to convey doesn’t require that. The following plot does not tells us the sentiment or comment count of each video, but does enough to tell us about each channel’s representation in the trending videos.

news.agg <- vids.u[vids.u$channel_title %in% news$channel_title, ]
news.agg <- aggregate(views ~ channel_title, news.agg, mean)
names(news.agg) <- c("channel_title", "mean_views")
news.agg

The plot:

ggplot(data=vids.u[vids.u$channel_title %in% news$channel_title,], aes(x=publish_time, y=views))+
  geom_hline(data=news.agg, aes(yintercept=mean_views, col=channel_title), linetype=2, alpha=0.8)+
  geom_point(aes(size=likes/views, col=channel_title), stroke=1, alpha=0.85)+
  guides(size=F)+
  labs(title="Clash of the Media Giants", x="Date", y="Video Views", subtitle="Vox leads in terms of quantity (most videos represented) and quality (most views on average)")+
  theme(legend.position = "bottom")

2.4.1 Using Themes

We can spice up our visualizations using another nifty feature of ggplot: themes! I’ve copied and pasted the code from our earlier exercise and added a theme using theme_linedraw(). I invite you to go ahead and swap out the theme and replace it with one of the other themes. Examples:
- theme_calc()
- theme_excel() - theme_gdocs()
- theme_classic()

ggplot(temp1, aes(x=Var1, y=Freq))+
  geom_col()+
  theme_linedraw()+
  labs(title="Most Represented Channels among Trending Videos")+
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Now try one more example - try and replace theme_wsj with one of the other themes in the following plot. A simple switch of the overall presentation “theme” using a short function like theme_ makes it easy to experiment with different aesthetics and is one of yet another benefit of using ggplot!

ggplot(data = vids.ags, aes(x=category_id, y=dislikes, size=comment_count/views))+
  geom_jitter(aes(col=publish_wday))+
  theme_wsj(base_size = 8)+
  labs(title = "Trending Videos Between 3 Categories")+
  theme(legend.position = "none")

2.5 Introduction to Leaflet [Optional]

Leaflet is among the most popular JS library for interative maps, used by websites such as The New York Times, The Washington Post, GitHub and Flickr3. The R package leaflet allows us to create leaflet maps directly in R code. The steps are as follow:

  1. Create a map widget by calling leaflet().
  2. Add layers (i.e., features) to the map by using layer functions (addTiles, addMarkers, addPolygons) to modify the map widget.

Sounds similar enough to the ggplot system? Let’s see a simple example.

I’m going to create two objects to be used for our Leaflet map later. First, an icon! Here I’m using Algoritma’s main icon and saving it to an object called ico. Next, we’ll create loca, a data frame that has two variables (lat and lng) with some randomly generated numbers. The code is straightforward:

set.seed(418)
library(leaflet)

ico <- makeIcon(
    iconUrl = "https://algorit.ma/wp-content/uploads/2017/07/logo_light_trans.png",
    iconWidth=177/2, iconHeight=41/2
)


loca <- data.frame(lat=runif(5, min = -6.24, max=-6.23),
                   lng=runif(5, min=106.835, max=106.85))

And we’ll now create our map:

# create a leaflet map widget
map1 <- leaflet()

# add tiles from open street map
map1 <- addTiles(map1)

# add markers
map1 <- addMarkers(map1, data = loca, icon=ico)

map1

Supposed we want the end user to be able to click on each of these icons and have a simple pop-up description, we can add that to our map too!

Create the pop-up text:

pops <- c(
    "<h3>Algoritma Main HQ</h3><p>Visit us here!</p>",
    "<strong>Algoritma Business Campus</strong>", 
    "<h3>In-Construction</h3><p>New Secondary Campus</p>",
    "<strong>Secondary Campus</strong>",
    "<strong>The Basecamp (business-school)</strong>"
)

Adding them to our map:

map1 <- leaflet()
map1 <- addTiles(map1)
map1 <- addMarkers(map1, data = loca, icon=ico, popup = pops)

map1

leaflet does a lot more than the simple demonstration above, but since it belongs to the optional part of this coursebook - I’ll leave it up to you, the readers, to further explore its possibilities! While I would recommend you to use ggplot as the main focus of your graded assignment, I want to leave the choice up to you. Work with your academic mentors to produce a visualization as specified in the learn-by-building module and good luck!

3 Summary

The coursebook covers many aspects of plotting, including using visualization libraries such as ggplot2, leaflet and a few other supporting libraries. I hope you’ve managed to get a good grasp of the plotting philosophy behind ggplot2, and have built a few visualizations with it yourself!

Happy coding!

Samuel

3.1 Annotations

Cleveland (1985), page 264: “Data that can be shown by pie charts always can be shown by a dot chart. This means that judgements of position along a common scale can be made instead of the less accurate angle judgements.” This statement is based on the empirical investigations of Cleveland and McGill as well as investigations by perceptual psychologists."


  1. This dataset is first contributed under CC0 public domain by Mitchell J (datasnaek on Kaggle), and maintained by other contributors. It has 13,400 records of trending videos between 14th November 2017 to 21st January 2018.↩︎

  2. Full Note on pie charts from the official R Documentation:
    "Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chart or dot chart is a preferable way of displaying this type of data.↩︎

  3. Official Documentation, Leaflet↩︎